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ABSTRACT 

The database of genotypes and phenotypes (dbGaP) 
developed by the National Center for Biotechnology 
Information (NCBI) is a resource that contains 
information on various genome-wide association studies 
(GWAS) and is currently available via NCBI's dbGaP 
Entrez interface. The database is an important resource, 
providing GWAS data that can be used for new 
exploratory research or cross-study validation by 
authorized users. However, finding studies relevant to a 
particular phenotype of interest is challenging, as 
phenotype information is presented in a non- 
standardized way. To address this issue, we developed 
PhenDisco (phenotype discoverer), a new information 
retrieval system for dbGaP. PhenDisco consists of two 
main components: (1) text processing tools that 
standardize phenotype variables and study metadata, 
and (2) information retrieval tools that support queries 
from users and return ranked results. In a preliminary 
comparison involving 18 search scenarios, PhenDisco 
showed promising performance for both unranked and 
ranked search comparisons with dbGaP's search engine 
Entrez. The system can be accessed at http://pfindr.net. 



INTRODUCTION 

The database of genotypes and phenotypes 
(dbGaP) is an important repository for data gener- 
ated through various genome-wide association 
studies (GWAS), which can be used for new 
explorations or cross-study validation. 1-3 In add- 
ition to genomic data, dbGaP requires investigators 
to submit phenotype data. As of 7 July 2013, 
dbGaP contained 422 studies, including more than 
130 000 phenotype variables. However, searching 
relevant studies accurately and completely is chal- 
lenging, because phenotypic information related to 
studies is often stored in a non-standardized way. 
For particular queries, the dbGaP Entrez system 
returns several studies that are not always relevant, 
and it does not make clear how particular records 
are selected and why they appear in a particular 
order. Consequently, users have to review each 
study description carefully to determine relevancy, 
which can become a laborious and time-consuming 
task when many studies are retrieved. 

To address this issue, we developed a new web- 
based information retrieval system called PhenDisco 
(phenotype discoverer) based on the user require- 
ments obtained by interviewing dbGaP users. The 
project is funded through the program entitled 
phenotype finder in data resources (pFINDR) from 



the National Heart, Lung, and Blood Institute. The 
goal of this program is to facilitate the search of phe- 
notypes in dbGaP's GWAS. Our approach uses 
natural language processing (NLP) as well as infor- 
mation retrieval techniques in order to improve 
phenotype search in dbGaP 

There are several related works that aim to 
address issues associated with the lack of standard- 
ization in phenotype variables. 3-9 PhenX defined 
287 frequently used phenotypes (called measures) 
in 21 research domains, and manually cross- 
mapped these measures to phenotype variables in 
16 dbGaP studies. 3 4 The goal is to use these mea- 
sures prospectively, so new studies are described in 
a standardized way. Another project, eMERGE, 
used a semi-automated process: users manually 
search for phenotype variables for specific domains 
(eg, Alzheimer's disease), and these variables are 
automatically mapped to standardized vocabularies 
through a tool called eleMAP eleMAP outputs are 
then further curated by users before results can be 
interpreted. 8 9 Our group was involved in similar 
work that annotated phenotypes in the gene 
expression omnibus (GEO), 10 a public gene expres- 
sion data repository. Human annotators reviewed 
the papers published using the data available in 
GEO, then manually identified the phenotype vari- 
ables and mapped them to the National Cancer 
Institute thesaurus. 5-7 Although the results of such 
manual or semi-automated mapping processes tend 
to be reliable and accurate for small data, the tech- 
nique is not scalable. Therefore, we developed an 
algorithmic approach to process the large amount of 
phenotype variables in dbGaP for standardization. 

METHODS 

PhenDisco consists of two main components: 
(1) text processing tools that standardize both 
phenotype variables and study metadata, and (2) 
information retrieval tools that support queries 
from users and return ranked results. Below we 
describe each component. 

Data collection and standardization 

We collected information about the GWAS and their 
phenotype variables from two publicly available 
dbGaP sources: (1) dbGaP web pages (http://www. 
ncbi.nlm.nih.gov/gap), and (2) the dbGaP FTP site 
(ftp://ftp.ncbi.nlm.nih.gov/dbgap). The dbGaP web 
pages contain information about individual study 
levels such as study ID, title, description, platforms, 
and the dbGaP FTP site contains phenotypic 
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information such as phenotype ID, phenotype description and 
associated statistics. We developed a crawler to download both 
types of data. We analyzed 422 studies, which contained 130 000 
variables. 

Study-level metadata generation 

Given that the number of new studies being added every month 
is small, we focused on automating the standardization of vari- 
ables, while the abstraction of study data itself was only partially 
automated. Portions of the study-level metadata are well struc- 
tured and amenable to full automatic parsing. Study ID, title, 
number of participants, and study design are automatically 
extractable study data. We extracted, through manual review, 
study data such as topic diseases, consent type, institutional 
review board status, and study locations. 11 12 To standardize the 
study information, the topic diseases were mapped to the 
unified medical language system (UMLS)'s concept unique iden- 
tifiers. 13 We adopted UMLS as a controlled vocabulary in this 
project based on its comprehensive domain coverage and wide- 
spread use in biomedical NLP systems. 14 15 In addition, we 
mapped study locations to ISO 3166-2 country subdivision 
code, 16 for example, US-AZ (USA — Arizona). 

Phenotype variable standardization 

The task of phenotype variable standardization has been the 
most interesting, yet most challenging, part of developing 
PhenDisco. The lack of a uniform naming convention meant 
that, for a study containing thousands of phenotype variables, 
idiosyncratic choices introduced unnecessary variation and 
redundancy across studies. For example, the same variable 'body 
weight' can be represented as 'weight' (variable id: 
phv00173256.vl.pl), 'WGHT' (variable id: phv00169068.v2. 
pi), and 'FB9' (variable id: phsv00001189.vl.p7). Therefore, 
variable descriptions, which provide more information than 
variable names, are more useful for the task of standardization. 
The lack of standardization is a well-known problem in clinical 
informatics; standards and information models, such as the clin- 
ical elements model (CEM), were designed to address this issue. 
The CEM worked reasonably well for clinical variables in elec- 
tronic medical records, but did not address clinical research vari- 
ables in dbGaP 17 While standards such as the observational 
medical outcome partnership (OMOP) model 18 19 cover many 
of these variables, given our experience mapping variables into 
OMOP for a very limited set of conditions, 20 we realized that 
the variables in dbGaP studies were described in less detail and 
determined that it would be more cost-effective and scalable to 
map them into a simpler model. 21 22 We briefly describe our 
approach as follows. 

We developed an information model including four major 
information classes: 'theme' (ie, age, gender, race, ethnicity), 
'subject', 'event', and 'linkage' of information. 21 23 24 For 
example, the phenotype variable 'age Mom diagnosed — asthma' 
has theme age, subject 'mother', event 'asthma', and linkage of 
information 'diagnosed'. We wrote a simple NLP tool in Python 
called DIVER to identify and map phenotype variables into this 
model. The evaluation on 3565 variables from pulmonary 
studies in dbGaP showed that DIVER achieved 98% recall and 
94% precision in identifying variables related to demographic 
concepts and 79% correct mapping into the information 
model. 23 

For variables that were not related to demographic concepts, 
we identified two categories of variables: 'topic' and 'subject of 
information'. The 'topic' is the main theme of phenotype vari- 
ables while the 'subject of information' is the individual 



experiencing the variable. For example, the phenotype variable 
'father diagnosed with lung cancer' has subject of information 
'father' and topic 'lung cancer'. We first tagged 'topic' and 
'subject of information' terms from each variable description, 
and then mapped those terms to the UMLS metathesaurus. 13 
This process was automatically implemented by our customized 
NLP tool. Further standardization of these variables based on 
information modeling and NLP is in progress. 21 

Information retrieval and ranking algorithm 

The information retrieval tool consists of two parts: a query 
parser and a ranking algorithm. 

Query parser 

We utilized pyparsing 25 — a toolkit written in Python — for 
parsing queries in PhenDisco. The role of a query parser is to 
take an input query and break it into its respective terms and 
operators. Search terms can be a single word or whole phrases, 
connected by operators (ie, AND, OR, NOT). To improve 
search performance, we expanded each input query to include 
synonyms by integrating MetaMap 26 into the query parser. This 
concept-based search is the default search mode of PhenDisco 
(see figure 1). 

Ranking algorithm 

We used the BM25F ranking algorithm, 27 28 as it is one of the 
most popular ranking algorithms for structured documents. 
BM25F is a modified tf-idf (term frequency — inverse document 
frequency) algorithm 29 that has been shown to enhance per- 
formance when dealing with documents composed of several 
fields such as title, headline, main text. 30 31 We considered each 
study using the different fields identified in the study abstraction 
process, such as title, study description, or topic disease, along 
with standardized phenotypes. In this first version of 
PhenDisco, we considered terms from different fields to be 
equally important, and we plan to analyze user searches and 
rankings to assign appropriate weights for these terms in the 
next version of the software. We utilized Whoosh, 32 a search 
library, to implement the BM25F algorithm. The system compo- 
nents are depicted in figure 2. The system is implemented in 
Linux Ubuntu OS 64-bit using 32GB RAM, running MySQL 
V14.14 on an Apache V2.2.20 web server and is available at 
http://pfindr.net. 

Key system features 

Currently, PhenDisco supports basic keyword searches and 
offers the following features that are not supported in dbGap 
Entrez: 

► Auto-complete: auto-completion of search term function was 
integrated with the search box, using the phenotype terms 
collected from the GWAS catalog. 33 

► Concept-based search: search term expansion by synonym 
based on UMLS metathesaurus mapping. 

► Highlighted search keywords: the terms relevant to the 
search keywords are highlighted in the search result display. 

► Ranked results: returned studies are displayed in ranked 
order, determined by the BM25F algorithm. 

► Customization of the result display: users can select the study 
level metadata such as title, study type, platform to display 
with the search results. Users can select and export results to 
the comma-separated values format. 
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Figure 1 Screenshot of the PhenDisco system. The top panel contains a search input box with concept-based search (ie, expandable terms) as the 
default. 



EVALUATION 

Gold standard dataset 

A domain expert developed 18 search scenarios related to par- 
ticular cardiopulmonary conditions. Search scenarios could 



included disease names such as 'asthma', 'myocardial infarction' 
in combination with demographics such as 'African American' 
and/or a clinical attribute such as 'FVC (forced vital capacity). 
The list of queries used for evaluation is listed in table 1. Use 



Figure 2 Components of the 
PhenDisco system: (1) sdGaP 
(semantic-driven genotypes and 
phenotype) database contains 
standardized phenotype variables and 
study metadata from dbGaP, and 
(2) information retrieval tools that 
parse input queries, map into 
information model and return ranked 
studies. sdGaP consists of data from 
dbGaP that are mapped into our 
information model, as well as study 
meta-data. 
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Table 1 List of 18 user-defined queries used for pilot evaluation 
Case 



no. Query 



1 Asthma 

2 Asthma AND 'African American' 

3 Asthma AND 'African American' AND Hispanic 

4 Asthma AND 'African American' AND 'skin test' 

5 Asthma AND 'African American' AND Hispanic AND 'skin test' 

6 Asthma AND 'African American' AND FEV1 

7 Asthma AND 'African American' AND Hispanic AND FEV1 

8 Asthma AND 'skin test' 

9 COPD 

10 'Chronic obstructive pulmonary disease' AND Caucasian 

1 1 'Chronic obstructive pulmonary disease' AND Caucasian AND 'high 
cholesterol' 

12 COPD AND hypercholesterolemia 

13 COPD AND FVC 

14 'Chronic obstructive pulmonary disease' AND Caucasian AND FVC 

1 5 'Myocardial infarction' 

16 'Myocardial infarction' AND black 

17 Ml AND BMI 

18 'Myocardial infarction' AND black AND BMI 



Table 2 Information retrieval performance of PhenDisco versus 
dbGaP on 18 user case queries 





Precision 


Recall 


F-measure 


MRP (top 5) 


MAP 


dbGaP Entrez 


0.0756 


0.5278 


0.1321 


0.0600 


0.0756 


PhenDisco 


0.3000 


0.9722 


0.4552 


0.4000 


0.2971 



MRP (top 5) is mean rank precision at top five retrieved studies, MAP is mean 
average precision. 



PhenDisco performance 

Our evaluation of PhenDisco and dbGaP Entrez was conducted 
on 10 January 2013. The results are shown in table 2 (see more 
details in supplementary appendix 2, available online only). For 
the limited number of queries that were evaluated, PhenDisco 
had substantially better performance than dbGaP Entrez, with an 
F-measure of 0.4552 versus 0.1321 for the unranked evaluation. 
When ranking was considered for the top five returns, PhenDisco 
also showed better performance than dbGaP Entrez with the 
MRP of 0.40 versus 0.06, and MAP of 0.2971 versus 0.0756. 

A preliminary evaluation of usability from three real dbGaP 
users indicated that PhenDisco fully satisfied the usability 
requirements they put forward (see more details in supplemen- 
tary appendix 3, available online only). 



cases were determined based on presumed clinical relevance, 
clinical interest, and potential future research impact. For 
example, in regard to use cases 1-9, 'asthma' was chosen 
because of its widespread prevalence. 34 

The domain expert then manually reviewed all dbGaP studies 
and created the gold standard for each search scenario according 
to the following steps: 

1. Review entire set of dbGaP studies and find studies that were 
relevant to 'disease' keywords (eg, 'asthma'), 

2. review all information resources (ie, study description, 
phenotype variable description) related to the retrieved 
studies, and 

3. find studies that include demographic information (eg, 
African American') and a clinical attribute (eg, FVC). 



Evaluation metrics 

We conducted a preliminary evaluation of the system using 
standard information retrieval measurements: precision, recall 
and F-measure for unranked studies. 35-37 For relevancy ranking, 
we used two measures: mean rank precision (MRP) and mean 
average precision (MAP). They are widely used in information 
retrieval evaluation for both general and biomedical texts. 38 ^ 1 
MRP is the mean value of the precisions computed over all 
queries at a certain cut-off rank. MAP is the mean value of the 
average precisions for each rank computed for all queries. 
Average precision is calculated as follows: 



Average precision 



YTi=i (precision (/) x rel(/)) 
number of relevant studies 



Here n is the number of returned documents; precision(z) is the 
precision at rank i, and rel(z') is an indicator function at rank /: it 
equals 1 if the corresponding study is relevant, and 0 otherwise. 
In our evaluation we chose the cut-off rank to be 5, which is a 
frequently selected cut-off point. 30 38 ^° 



DISCUSSION 

PhenDisco achieved higher recall and precision than dbGaP in 
both unranked and ranked results in this pilot evaluation. Through 
error analysis, we found that dbGaP's low precision was mainly 
due to its acceptance of search terms that appear in any text in any 
part of the study, including less relevant contexts such as exclusion 
criteria or title of papers referenced on the study description. On 
the other hand, the main reason for the low recall of dbGaP Entrez 
is the lack of standardization of phenotype information. In other 
words, dbGaP Entrez only supported string-based search, thus 
search terms such as 'myocardial infarction' were not expanded 
into synonymous or acronyms such as 'heart attack' and 'MF. The 
fact that dbGaP Entrez returns unranked results accounts for that 
system's low performance in the relevance ranking evaluation. 

Precision in PhenDisco was higher than in dbGaP Entrez, but 
was still lower than expected. This may have resulted from the util- 
ization of too stringent a criterion to consider a particular study as 
being 'relevant' for the search. The domain expert was focused on 
the primary goals of the studies for this formative evaluation, and 
not on the availability of the phenotype in general (eg, if 'asthma' 
was not a main subject for a study, then the domain expert consid- 
ered the study not to be relevant, although the study might have 
contained individuals with that phenotype and hence it would not 
be necessarily a false positive). In the comparison between Entrez 
and PhenDisco, however, using a stringent criterion affected both 
systems equally. In future work we will investigate the appropriate- 
ness of using a less stringent criterion to categorize studies into 
relevant or not relevant for a particular search. We believe that the 
best way to categorize may be to obtain direct feedback from 
users. For example, by unselecting studies that appear in the 
output, users are indicating that they are irrelevant for their 
searches. Once we collect data from a large number of users, we 
will be able to enhance our system and provide more accurate pre- 
cision and recall estimates. 

PhenDisco may be a good alternative to dbGaP Entrez for scien- 
tists who need to identify studies that contain the phenotypes they 
are interested in. Some advantages of PhenDisco over dbGaP 
Entrez are: (1) PhenDisco integrates NLP tools to enhance query 
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processing and phenotype variable mapping; (2) PhenDisco aug- 
ments background knowledge from domain experts by adding 
meta-data for the studies; and (3) PhenDisco's results are ranked 
in descending order of relevance. The main disadvantage of 
PhenDisco is that, unlike dbGaP Entrez, which relies on keyword 
search in any portion of a study document, PhenDisco's search is 
performed on study and variable descriptions only, based on meta- 
data that are produced by a process that is not fully automated. We 
use a curator to verify a large portion of the results of an auto- 
mated mapping process and to fix annotations as needed. Given 
our simple information model, it takes less than 30min for a 
curator to validate the majority of the meta-data and this is why 
we were able to annotate all studies in dbGaP with the help of 
part-time curators. As the number of new studies is relatively small 
when compared to over 400 that underwent this process, the semi- 
automated process is scalable and is not a bottleneck. We plan to 
improve further the information model and mapping algorithm 
and use the same process to annotate phenotypes in GEO and 
other public data resources. 

In the future, we plan to add more features to the current system 
and keep our users updated by prominently displaying the changes in 
the home page of PhenDisco's web site. These features include: (1) 
improving the search performance, especially by integrating search 
queries with ontology expansions for concepts' children; (2) improv- 
ing PhenDisco's advanced search, by incorporating other types of 
study level meta-data; (3) providing efficient ways of identifying and 
browsing similar phenotype variables collected across different 
studies using clustering techniques. We also plan to apply more 
sophisticated NLP techniques to improve precision of the system to 
account for detection of negated concepts and temporal relationships, 
and promote broader dissemination of the tool and meta-data 
through the iDASH National Center for Biomedical Computing. 42 

Correction notice This article has been corrected since it was published Online 
First. The last author's name was previously incorrect and has now been corrected. 
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